perf: use BMI2 bit deposit and extract helpers#779
Conversation
Change-Id: I4bdf73fd74f7b4d8d699124cdca2c2bd7b292e8b
Change-Id: I1fe81d656fe85b0c893e4baf0562c871bc275b81
|
Neat! Could you prepare a quick benchmark to demonstrate the performance benefit? Some candidate sizes are mentioned in the issue. Single-threaded, CPU-only (no GPU), single-process is sufficient! For ease of cross-platform benchmarking, you can define a benchmarking script which just prints the timings, placed in the To get a comparison with the original code, we can first get the benchmark running in CI with the new BMI stuff in this PR. Then we can open a new PR with the same benchmarking example and no other changes, to get the old numbers. A simple benchmarker is shown in this example EDIT: given the BMI are used when inputs are sorted, you can ctrl-F for |
|
Thanks, I prepared a quick single-threaded, CPU-only microbenchmark for the two helper paths. My local machine is Apple Silicon, so I used a temporary fork branch and a GitHub-hosted native x86_64 runner with BMI2 exposed:
So the clearest win in this quick check is the candidate This is intentionally just a helper-level microbenchmark rather than a full simulation throughput claim; memory bandwidth and surrounding kernel work can still dilute the effect in end-to-end operations. I kept the benchmark on the temporary fork branch rather than adding it to this PR, but I can add or adjust a benchmark if you would like it kept in-tree. |
Change-Id: Iea9e065dafcbd514b16353eb1df9cc0b3b24f879
|
Following up on your suggestion, I added an in-tree automated benchmark in
Local verification on my Apple Silicon machine: That local run built and printed fallback timings successfully. On x86_64 GCC/Clang runners where |
|
The previously run CI confirms no compilation issues have been introduced on any backend, so I've tailored the CI to run only what's relevant for this diff (single-CPU, automated examples) |
|
The CI results: Only Linux used the intrinsincs (MacOS did not). Is this expected? |
Change-Id: Ia987f8b01088101ddbb8f43c180c19f3e07c085c
|
Yes, this is expected for the current conservative gate. The BMI2 path is only enabled when the target compile actually defines That explains the current CI shape:
I also pushed |
|
One more CI note: the latest runs for |
|
Yes I understand the fallback - put the human on for a second :) You mentioned you were running on an "Apple Silicon machine", initially suggesting you couldn't use BMI2 (i.e. are using ARM) so spun up the x86 image, but subsequently you claim to have "locally verified" BMI2 on your Apple same machine. Which is it? I've updated the CI to include x86 MacOS (if those images are themselves not an AI hallucination - God help us!) |
|
Fair callout. My wording was muddy there. The machine on my desk is Apple Silicon / arm64. On that machine I only built and ran the fallback path; it cannot natively execute BMI2. When I said the Apple Silicon local verification passed for the in-tree benchmark, I meant: the benchmark builds and runs locally and correctly prints the fallback timings / The BMI2 performance numbers came from the separate GitHub-hosted native x86_64 Ubuntu runner in my fork workflow, not from the Apple Silicon machine. Earlier x86_64 So the intended split is:
Sorry for making that sound more mysterious than it was. |
Change-Id: I54386dfde9f0cf209e54e79c80937170e35e622b
|
I also pushed That fixes the two accidental side effects from adding
The refreshed runs for |
|
I've re-triggered the tests. Can you repeat back to me what the next step will be, as earlier discussed? |
|
Yes. The next step is to create a separate baseline PR that adds the same That baseline PR should let CI run the identical benchmark against the current fallback/original implementation. Then we can compare:
That should give a like-for-like CI comparison for the old and new helper paths before making a final call on this optimization. |
Summary
Closes #717.
This adds guarded x86 BMI2 fast paths for QuEST's hot bit insertion/extraction helpers:
_pdep_u64forinsertBits(...)when BMI2 is available;_pdep_u64directly ininsertBitsWithMaskedValues(...), avoiding the extra loop in its heavily used masked path;_pext_u64ingetValueOfBits(...)only whenbitIndicesare strictly increasing, preserving the existing arbitrary-order behavior by falling back to the loop otherwise;INLINEhost/device functions remain kernel-compatible.I also added internal unit coverage for:
Validation
Local validation run on Apple Silicon:
# Syntax-checked the BMI2 branch by targeting x86_64 with -mbmi2. c++ -target x86_64-apple-macos15 -mbmi2 -std=c++20 \ -I/tmp/quest_cfg -I. -x c++ -fsyntax-only -The targeted
bitwisetest run passed with 8 assertions in 3 test cases.I also ran:
This completed with 266/280 tests passing. The 14 non-passing tests were all 120-second timeouts in existing exhaustive multi-controlled operation tests on this machine; there were no assertion failures. The new bitwise tests are CTest entries 1-3 and all passed.
AI Assistance Disclosure
I used Codex as a coding assistant to inspect the existing bitwise implementation, draft the guarded intrinsic path, and prepare local validation commands. I reviewed the issue requirements and the resulting diff, preserved the arbitrary-order semantics of
getValueOfBits(...), and ran the validation listed above before submitting.Closes #717